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Abstract. The standard approach to Bayesian inference is based on the assumption that 
the distribution of the data belongs to the chosen model class. However, even a small vio¬ 
lation of this assumption can have a large impact on the outcome of a Bayesian procedure. 
We introduce a simple, coherent approach to Bayesian inference that improves robustness to 
perturbations from the model: rather than condition on the data exactly, one conditions on 
a neighborhood of the empirical distribution. When using neighborhoods based on relative 
entropy estimates, the resulting “coarsened” posterior can be approximated by simply tem¬ 
pering the likelihood—that is, by raising it to a fractional power—thus, inference is often 
easily implemented with standard methods, and one can even obtain analytical solutions 
when using conjugate priors. Some theoretical properties are derived, and we illustrate 
the approach with real and simulated data, using mixture models, autoregressive models of 
unknown order, and variable selection in linear regression. 

Key words and phrases. Bayesian inference. Mixture model, Model misspecification. Relative 
entropy, Robustness, Tempering. 


1. Introduction 

In many applications, the most natural models are idealizations that are known to provide 
only an approximation to the distribution of the observed data, due to small-scale contam¬ 
inating effects that may be complicated and not completely understood. One might hope 
that any such lack of model fit, if sufficiently small, would not significantly impact inferences 
or decisions made based on the model. Often this does seem to be the case, but sometimes, 
unfortunately, the likelihood is strongly affected by perturbations to the distribution of the 
observed data, especially when the sample size is large. 

As a result, standard Bayesian procedures are not generally robust to contamination or 
misspecification. In particular, this may lead to underestimation of uncertainty, since the 
posterior inexorably concentrates at a given rate—typically at minimal Kullback-Leibler 
points—regardless of whether or not it is concentrating on something that resembles the 
observed data distribution. 

This issue was, perhaps, not as severe in the past, since on small datasets, statistical 
models automatically exhibit a certain amount of robustness to misspecification, because, in 
essence, there is insufficient power to discern small departures from the model. However, as 
datasets grow ever larger, even slight deficiencies in our models can become disruptive. 

The problem is especially apparent in the context of Bayesian model averaging and flexible 
Bayesian models, such as nonparametric models. Under misspecification, model averaging 
tends to favor more complex models as the sample size grows, even when for all practical 
purposes the data are very well-described by a simpler model. Since it is usually unreasonable 
to expect real data to come exactly from a simple parametric model, we thus find ourselves 
in the awkward situation of eventually rejecting any such model. 

I 
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Figure 1. Gaussian mixture with a prior on the number of components k, 
applied to data from a two-component skew-normal mixture. Top: Density of 
the data distribution (blue line), and a histogram of n = 100 samples. Middle 
left: The posterior on k favors larger and larger values as n increases. Bottom 
left: The coarsened posterior on k stabilizes as n increases, favoring the true 
number of components, k — 2. Middle right: Mixture density (dotted black 
line) and components (solid colors) for a typical sample from the posterior 
when n = 10^. Bottom right: Same for the coarsened posterior. See Section 7.3 
for details. 


At first, it may seem paradoxical that nonparametric models could suffer from misspec- 
ification issues, since after all, the point of a nonparametric model is that it can fit any 
distribution. It is true that if one is solely interested in fitting the data distribution—such as 
in density estimation or when the distribution is a nuisance parameter—then a nonparamet¬ 
ric model will serve nicely, in principle. However, most nonparametric models still involve 
some parametric assumptions (such as Gaussian mixture components) and when the data 
do not accord well with these assumptions, the interpretability of parameters and latent 
variables (such as component parameters and cluster assignments) breaks down. Further, 
the computational burden of nonparametric models can grow rapidly with the sample size. 

For example, suppose one is using a Gaussian mixture model with a prior on the number 
of components, but the data come from a mixture in which the components are not exactly 
Gaussian. In order to fit the data, the posterior will introduce more and more components 
as the amount of data increases, and the inferred components will not accurately reflect 
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the true components. To illustrate, suppose the data are from a two-component mixture 
of skew-normal distributions; see Figure 1. As shown in the figure, the posterior on the 
number of components favors larger and larger values as the sample size increases, and a 
typical sample from the posterior consists of many small components, obscuring the two large 
groups corresponding to the true components. Meanwhile, using the technique introduced 
in this paper, one can construct a “coarsened” posterior for the same model, under which 
the number of components does not continue to grow—see Figure 1—and as a result, one 
obtains a more macroscopic interpretation of the data as coming from two large groups. See 
Section 7.3 for details. 

Ideally, one would model all aspects of the data generating process completely correctly, 
however, this is often impractical for a number of reasons. First, it may be unrealistic 
to expect to have sufiicient insight into the data generating process to even write down 
an adequate model. Further, this can significantly increase the time and effort required to 
design the model, devise and implement reasonably efficient inference algorithms with reliable 
performance, and perhaps develop theoretical guarantees. Even after all such efforts, one may 
end up with a complex hierarchical model that is unlikely to be used in scientific applications, 
because scientists prefer statistical methods that have a clear and simple interpretation, and 
are not comfortable with drawing inferences from complicated models in a “black box” 
fashion. 

In fact, in some cases, a simple model may actually be more appropriate than a more 
complex one, even when it does not exactly fit the observed data. For instance, when the 
underlying phenomenon of interest is well-described by a simple model, but there is a lack 
of fit due to contaminating artifacts, then choosing a more complex model can be viewed as 
a form of overfitting, and may damage the resulting inferences. Meanwhile, even when the 
underlying phenomenon is not well-described by a simple model, many times the purpose of 
a model is to provide a lens through which to understand the data, rather than just fitting 
it—such as when the model is being used as a tool for exploratory analysis—in which case it 
is essential to use interpretable models so that the parameters and latent variables provide 
insight into the questions of interest. 

There have been advances in robustness to misspecification, with methods such as Gibbs 
posteriors (Jiang and Tanner, 2008), disparity-based posteriors (Hooker and Vidyashankar, 
2014), partial posteriors (Doksum and Lo, 1990), nonparametric approaches (Rodriguez and 
Walker, 2014), neighborhood methods (Liu and Lindsay, 2009), and learning rate adjustment 
(Grunwald and van Ommen, 2014); see Section 4 for a discussion of previous work. However, 
despite growing recognition of the issue and efforts toward a solution, existing methods tend 
to be either limited in scope, computationally prohibitive, or lacking a clear justification. 

These considerations lead to the following questions. Is it possible to draw coherent infer¬ 
ences from a model that may be slightly misspecified? Gan this be done in a computationally- 
tractable way? In the context of model averaging and nonparametrics, is there a principled 
way to be tolerant of models that are not exactly right, but are close enough in some sense? 

In this article, we explore a novel approach to robust Bayesian inference that may provide 
affirmative answers to these questions. Instead of using the standard posterior obtained 
by conditioning on the event that the observed data are generated by sampling from the 
model—which is clearly incorrect when the model is misspecified—the approach we consider 
is, roughly speaking, to condition on the event that the empirical distribution of the observed 
data is close to the empirical distribution of data sampled from the model, with respect to 
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some statistical distance on probability measures. We refer to this as a coarsened posterior, 
or c-posterior, for short; see Section 3 for a detailed description. 

The c-posterior approach has a number of appealing features. It has a compelling 
justification—it is valid Bayesian inference based on limited information. The interpretation 
is conceptually clear—one does inference with the same model, but conditioned on a different 
event than usual. The c-posterior inherits the continuity properties of the chosen statistical 
distance, and thus, automatically exhibits robustness to small departures from the model— 
that is, small changes to the data distribution result in small changes to the c-posterior. 
Asymptotically, the c-posterior takes a relatively simple form, facilitating computation and 
analysis. 

A particularly attractive case occurs when using neighborhoods based on relative entropy 
estimates, since then it turns out that the c-posterior can be approximated by simply raising 
the likelihood to a certain fractional power; see Section 3.1. Consequently, in this case one can 
often do approximate inference using standard algorithms, with no additional computational 
burden—in fact, the mixing time of Markov chain Monte Carlo (MCMC) samplers will 
typically be improved, since the likelihood is tempered. Further, when using exponential 
families and conjugate priors, one can even obtain analytical expressions for quantities such 
as a “robustified” marginal likelihood. 

The main disadvantage of c-posteriors is that sometimes they are less concentrated than 
one would like—for instance, if it turns out that the amount of misspecification is less than 
expected. 

An unexpected side benefit of our investigation of c-posteriors is that it reveals an inter¬ 
esting connection between Gibbs posteriors and approximate Bayesian computation (ABC), 
two areas of current research in Bayesian statistics. Roughly, a Gibbs posterior can be 
thought of as an asymptotic approximation to a particular ABG posterior; see Section 3. 

The paper is organized as follows. First, to demonstrate the basic idea, in Section 2 
we consider a c-posterior for the simplest possible toy example: Bernoulli trials. Then, in 
Section 3, we describe the c-posterior approach more generally. We discuss connections with 
previous work in Section 4, and in Section 5 we establish various theoretical properties of 
c-posteriors regarding asymptotics and robustness. In Section 6, we show how the method 
extends to time series and regression, and in Section 7, we apply the c-posterior approach to 
autoregressive models of unknown order, variable selection in linear regression, and mixture 
models with an unknown number of components. We close with a brief discussion of possible 
directions for future work. 

2. Toy example: Bernoulli trials 

For expository purposes, we first introduce the c-posterior in a toy example. Suppose 
Xi,, Xn i.i.d. ~ Bernoulh(d) represent the outcomes of n replicates of a laboratory exper¬ 
iment, and the team of experimenters is interested in testing Hq : ^ = 1/2 versus Hi ■.6^1/2. 

The standard Bayesian approach is to define a prior probability for each hypothesis, say, 
n(Ho) = n(Hi) = 1/2, and define a prior density for 9 in the case of Hi, say, 6*|Hi ~ 
Uniform(0,1). Inference then proceeds based on the posterior probabilities of the hypotheses, 
n(Ho|Ti:ji) and n(Hi|Ti:„) = 1 — n(Ho|Ti:„), where Xi-n = (ti, ... ,t„). If the observed data 
Ti,... are sampled i.i.d. from Bernoulli(0), then the posterior is guaranteed to converge 
to the correct answer, that is, n(Ho|Ti:„) l(d = 1/2) as n ^ oo. (We use l(-) to denote 
the indicator function: 1{E) = 1 if FI is true, and 1(E) = 0 otherwise.) 
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In reality, however, it is likely that the observed data do not exactly follow the assumed 
model. For instance, some of the experiments may have been conducted under slightly 
different conditions than others (such as at different times or by different researchers), or 
some of the outcomes may be corrupted due to human error in carrying out the experiment. 
Of course, in such a simple setting as Bernoulli trials, it would be easy to improve the model 
to account for issues such as these. However, for more complex models it is often not so easy, 
as discussed in the introduction, and we seek a method that works well even with complex 
models. 

Suppose it is known that any such corruption affects the distribution of the data by only a 
small amount. We can formulate this mathematically by considering Xi:n to represent some 
hypothetical “true” data which do follow the model, and supposing that the observations 
Xi:n are close to the true data in some distributional sense, but not necessarily equal to 
it. A natural way to define distributional “closeness” is in terms of the relative entropy 
D{px\\px) = '^l=oPx{'i) ^og{px{i)/px{i)) between the empirical distributions of xi-n and Xi.„, 
i.e., ^ 2 ,( 1 ) = a; in this example. 

Due to the corruption, it is inappropriate to condition on the true data Xi:n being exactly 
equal to the observed data Xi:n. Instead, if it is known that Xi:n is close to Xi^n in the sense 
that D(px\\px) < r, and nothing more is known about the nature of the corruption, then a 
natural Bayesian approach would be to condition on the event that D(px\\px) < r, that is, 
to use n(Ho I D{Px\\px) < r). In other words, rather than conditioning on the data exactly, 
condition on a relative entropy neighborhood of the empirical distribution of the data. 

In practice, one will typically only have a rough idea about the amount of corruption, and 
thus, it makes sense to put a prior on r, say, R ~ Exp(q;). This leads us to consider the 
following “coarsened” posterior, or c-posterior, for inferences about Hq and Hi: 

( 2 . 1 ) U{Eo\Dipx\\px)<R). 

In other words, we consider n(Ho|.Z^ = 1) where Z = l{D{px\\px) < R)- How should we 
choose a? In this example, we can interpret the neighborhood size r in terms of intuitive 
Euclidean notions by using the chi-squared approximation to relative entropy, D{p\\q) 
lx^(p, q) (see Prop. A.l). In particular, when X fa 1/2 we have D{psc\\px) ~ ‘2\x — Xp, and 
thus, if we expect the corruption to shift the sample mean by no more than £ or so when 
Ho : d = 1/2 is true, then it makes sense to choose a so that ER ~ 2£^. Since ER = 1/a 
this suggests using a = l/(2£^). 

In this toy example, the c-posterior in Equation 2.1 can be computed exactly (see Sec¬ 
tion A.l for details), however, in more complex cases, an approximation will be needed. In 
Section 3.1, we develop a general approximation which, when applied to this example, yields 

(2.2) n(Ho I D{p^\\px) < i?) 1/(1 + 2“”B(1 -F 1 -F «„(! - x))) 

where «„ = l/(l/n-|- l/a) and B{a,b) is the beta function (see A.l for details). Comparing 
this to the standard posterior, 

(2.3) n(Ho I Xi,n = Xi.,n) = 1/(1 + 2’"B(1 + nx, 1 + n(l - x))), 

note that the only difference is that n has been replaced by an in the c-posterior. 

To illustrate numerically, suppose we would like to be robust to perturbations affecting x by 
roughly £ = 0.02 when Hq is true. As described above, this corresponds to a = 1/(2-0.02^) = 
1250. Now, suppose that in reality Hq is indeed true, and the data are corrupted in such 
a way that xi,...,x„ behave like i.i.d. samples from Bernoulh(0.51). Figure 2 (top and 



6 


JEFFREY W. MILLER AND DAVID B. DUNSON 




Figure 2. Bernoulli trials example. Top: Results from a single sequence 
xi,X 2 ,... i.i.d. ~ Bernoulli(0.51). Middle: Average over 1000 sequences 
xi,X 2 ,... i.i.d. ~ Bernoulli(0.51). Bottom: Same as middle, but with 0.56 
instead of 0.51. In all three plots, the approximate c-posterior is indistinguish¬ 
able from the exact c-posterior. 

middle) shows the probability of Ho under the standard posterior, the exact c-posterior, and 
the approximate c-posterior (Equations 2.3, 2.1, and 2.2, respectively), for increasing values 
of the sample size n. 

When n is small, there is not enough power to distinguish between 0.5 and 0.51, so the 
standard posterior favors Hq at first (due to the Bartlett-Lindley effect), but as n increases, 
eventually the posterior probability of Hq goes to 0. (So, when n is large, the standard 
posterior is not robust to this perturbation.) Meanwhile, the c-posterior behaves the same 
way as the standard posterior when n is small, but as n increases, the c-posterior probability 
of Ho remains high, as desired—thus, the c-posterior remains robust for large n. Note, 
further, that the curve for the approximate c-posterior is directly on top of the curve for the 
exact c-posterior—the approximation is so close that the two are indistinguishable. 

What if the departure from Hq is signiffcantly larger than our chosen tolerance of e = 0.02? 
Does the c-posterior more strongly favor Hi in such cases, as it should? Indeed, it does. 
Figure 2 (bottom) shows the three posteriors on data xi,... ,Xn i.i.d. ~ Bernoulh(0.56). We 
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see that in this case, the c-posterior behaves more like the standard posterior, favoring Hi 
when n is sulRciently large. 

It is important to note that, unlike the standard posterior, the c-posterior does not concen¬ 
trate as n —7- (X). This is appropriate, since in the presence of corruption, some uncertainty 
always remains about the true distribution, no matter how much data is observed. 

3. Method 

In this section, we describe the c-posterior approach more generally. For discussion of 
connections with previous work, see Section 4; in particular, our ideas have been influenced 
by Lindsay and Liu (2009) and Wilkinson (2013). For the time being, we assume i.i.d. 
data, but the approach generalizes, for example, to time-series (Section 6.1) and regression 
(Section 6.2). 

Suppose we have a model {Pe : 0 E 0} along with a prior H on 0, and suppose there is 
a point 6i E O representing the parameters of the idealized distribution of the data. The 
interpretation here is that 6*/ is the “true” state of nature about which one is interested 
in making inferences; it may represent some actual underlying truth or may merely be a 
useful fiction. (For context, note that in many scientific endeavors, one employs idealized 
models that capture the most important features of the phenomena of interest, without 
harboring any illusions that they completely describe every detail.) Now, suppose there 
are some unobserved idealized data Xi,..., E X which are i.i.d. from however, 
the observed data xi,... ,Xn E X are actually a slightly corrupted version of Xi,... ,Xn 
in the sense that d{Pxj^,„, Pxun) < ^ some statistical distance d{-,-) and some r > 0, 
where Pxi,^ = ^ denotes the empirical distribution of Xi:n = (xi,..., x^). Suppose 

Xi,... ,Xn behave like i.i.d. samples from some Pq, and note that due to the corruption, we 
expect that Pq ^ Pe,. For intuition, consider the diagram in Figure 3. 

If there were no corruption (i.e., contamination/misspecification), then we should use the 
standard posterior—that is, we should condition on the event that Xi-n = xi-n —however, 
due to the corruption this would clearly be incorrect. Of course, if one could easily model 
the corrupting process by which xi-n is generated from Xi:„, then the most sensible approach 
would be to simply incorporate it into the model, but this may be impractical, as discussed 
at length in the introduction. 

An alternative approach is to condition on what is known—that is, to condition on the 
event that d{Pxj^,^, Pxi J < In other words, rather than the standard posterior Yi{d6 \ 
X\.,n = Xi:n), consider Yi{d0 \ d{Pxj^,„, Pxi,n) < ^)- Since usually one will not have sufficient a 
priori knowledge to choose r, it makes sense to put a prior on it, say R H, independently of 
9 and Xi:n. Generalizing further, take a sequence of functions such that Xi-n) > 0 

is some measure of the discrepancy between Ai,„ and Xi-^^. 

Definition 3.1. We refer to Yi{d6 \ dn(-Ti:„, ti:„) < P) as a c-posterior. 

It is useful to note that one can write the c-posterior as 

U{d6 I dn{Xi:n,Xi.,n) < P) OC Il{d0) F{dn{Xi.,n, Xi:n) < R \ O) 

(3.1) =U{de) [ G{dn{x[,n,xi:n))Po{dx[.J 

where G'(r) = P(P > r). The intuitive interpretation is that, to use a rough analogy, this inte¬ 
gral is like a convolution of Pq (the distribution of Xi:n) with the “kernel” G{dn{Xi:n, Ti:„)). 
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Figure 3. Notional schematic diagram of the idea behind the c-posterior. 

The ambient space is the set of probability distributions on T, and the curve 
represents the subset of distributions in the parametrized family {Pg : 9 G 0}. 

The idealized distribution Pg^ is a point in this subset, and the empirical 
distribution PXi.n of the true data converges to Pg^ as n ^ oo. Although Pxv.n 
is not observed, it is known to be within an r-neighborhood of the empirical 
distribution Px^.^ of the observed data, which, in turn, converges to Pq. The 
basic idea of the c-posterior approach is to condition on the event that Pxi.n 
is within this neighborhood. 

Some readers will recognize the form of the c-posterior in connection with approximate 
Bayesian eomputation (ABC), where it arises due to the nature of the approximation (Mar¬ 
joram et ah, 2003). Needless to say, our motivation for using it is completely different than 
in ABC; see Section 4 for discussion. 

While one could do inference for the c-posterior using algorithms similar to those used for 
ABC, this would be very slow, and would not exploit the fact that the likelihood is tractable. 
Instead, we develop approximations to the c-posterior that facilitate efficient inference. The 
crudest approach is based on the asymptotics of the c-posterior (see Section 5.1), which show 
that for large n, 

(3.2) Ti(de I A,J <R)9S G(d(P„, A,J)n(<i«) 

under mild regularity conditions; the intuition here is that when n is large, Pxi,n ~ Po with 
high probability. Thus, if d{Pg^Px^.^) can be easily computed, Equation 3.2 can be used 
to perform approximate inference for the c-posterior by, for instance, using it as a target 
distribution in Metropolis-Bastings MCMC. If i? ~ Exp(q;), so that G{r) — Equation 
3.2 becomes exp(—«d(P 0 , Px^.^))^{d6)^ which some readers will recognize as a Gibbs posterior 
(Jiang and Tanner, 2008); see Section 4. More generally, similar approximations can be made 
when using a c-posterior based on (i„(Ai,„, xi:„). 

A major disadvantage of an asymptotic approximation as in Equation 3.2, however, is that 
it is only good when n is sufficiently large; in a sense, it ignores the randomness in Ai,„. For 
the case of relative entropy, we develop a much better approximation that is also applicable 
for smaller n; see Section 3.1 below. 

There are many possible choices of statistical distance d{-,-), and the robustness properties 
of the c-posterior depend on this choice; in Section 5 . 2 , we show that as one would expect, the 
c-posterior is robust to changes in Pg that are small with respect to d(-, •). We use the term 
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statistical distance very broadly, to mean any nonnegative function for assessing discrepancy 
that is meaningful for a given application. A few potential candidates would be Kolmogorov- 
Smirnov (in the univariate setting), Wasserstein, or a maximum mean discrepancy (Gretton 
et ah, 2006). When P 0 and Po admit density functions, it is also possible to accomodate 
distances on densities, such as relative entropy, Hellinger distance, and various divergences— 
even though they may be undefined for empirical distributions—by choosing dn{Xi.n,xi-n) 
to be a consistent estimator of d{P 0 , Po). 

In the applications presented in this paper (see Section 7), we focus on relative entropy 
(and variations thereof) as our choice of d{-,-), since it works out exceptionally nicely in 
several respects. In particular, it turns out that in this case there is a trick that makes it 
unnecessary to explicitly compute d„(Ai:„, Ti:„). We discuss this next. 

3.1. Relative entropy c-posteriors. In the case of relative entropy, there is an approxi¬ 
mation to the c-posterior that improves upon Equation 3.2 in two respects: it is applicable 
for small n, and it is extremely easy to work with—in fact, it simply amounts to tempering 
the likelihood. As above, we assume i.i.d. data for the moment, and refer to Sections 6.1 
and 6.2 for generalizations. 

Suppose Po and P 0 (for all 6* G 0) have densities Po and p 0 , respectively, with respect to 
some sigma-finite measure A. Define 

d{P0,Po) = DiPo\\p0) = j Po(T)(log^)A(dT), 

and suppose dn(Wi:„, ti:„) is a consistent estimator of D{po\\p 0 ). If A ~ Exp(q;), then 
asymptotically, the c-posterior based on dn(-^i:n) Ti:„) is proportional to 

(3.3) eyip{-aD{po\\p0))Yl{dO) oc exp(Q; /po Iogp0)ll{d9) 

1 ^ 

exp E Iogp0(a;i)^n((i6') 

i=l 

n 

= u{de)l[p 0 {xi)^/^ 

i=l 

under mild regularity conditions; see Section 5 . 1 . 2 . When n is small relative to a, however, 
this is unsuitable as an approximation to the c-posterior, since in particular, iia/n> 1 then 
this makes the likelihood more concentrated, rather than less. 

Instead, we propose a better alternative, based on a central limit theorem approximation. 
The derivation is well-founded when the sample space X has finitely-many elements, and 
the extension to general A, while heuristic, is intuitively sensible. When \X\ < 00 , a natural 
choice of dn(-Ei:n, Ti:„) is simply D{px^.^\\pxi,n)) fhat is, the relative entropy of the empirical 
densities. Assume R ^ Exp(q;). Then by Equation 3.1 and an approximation detailed in 
Section A. 2 , 

n(d 6 ' I dn(^i:n,Ti:n) < R) qc E{ex.p{-aD{p^^J\px,.,„)) \ 0)ll{de) 

^ (nCn/«)^ exp{-nCnD{p^^J\p 0 ))Il{de) 
oc exp (CnXr=i ^Ogp0{xi))ll{d9) 
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where 

(3.4) 


^ 1 /n 

1/n + l/a 


1 

1 ~j~ TijOL 


The rough idea is that Pxi:„ is approximately multivariate normal with mean 'Pq and precision 
of order n, and the exponentiated relative entropy approximates a normal density with mean 
Pxi:„ and precision of order a, so the expectation above behaves like the convolution of two 
normals, and the resulting precision is of order l/(l/n + l/Q;) = n^n- As demonstrated in 
Figure 2, this approximation can be quite accurate even when n is small, although more 
experimentation is needed to assess its accuracy in more general situations. Note that 
Cn ~ Q;/n when n ^ a, and Cn ~ 1 when a ^ n; thus, we refer to this as the small-sample 
correction to the asymptotic approximation. 

This leads to the following useful approximation to the relative entropy c-posterior: 


(3.5) Il(d9 \ dniXi,n,Xi:n) < R) gS 11(^6') JJ (Xj)^”. 

i=l 


Note, in particular, that this enables one to approximate the c-posterior without explicitly 
computing the relative entropy estimates dn{Xi:n,Xi:n), which would normally involve com¬ 
puting a density estimate of Po in order to handle the JPologPo term in D{po\\p 0 ). Since 
this term is constant with respect to d, it is absorbed into the constant of proportionality, 
allowing one to bypass this density estimation step, which would be both computationally 
expensive and statistically inefficient, especially in high dimensions. 


Definition 3.2. Given ( G [0,1], we refer to as a power likelihood, and to the 

distribution proportional to Il{d 6 ) YYi=iPe{xiY as a power posterior. 

There are a number of other methods in which a power likelihood is employed (see Sec¬ 
tion 4), however, to our knowledge, the form of power we use (i.e., C = Cn), and its theoretical 
justification, are novel. 

A useful interpretation of our power posterior is that it corresponds to adjusting the sample 
size from n to ra/n- Thus, roughly speaking, by choosing a particular value of a, one makes 
the power posterior tolerant of all 6*’s for which a sample of size a from Po could plausibly 
have come from Pg. This idea is closely related to the model credibility index of Lindsay 
and Liu (2009). This suggests the interesting possibility of approximating the c-posterior by 
taking random subsets of size a, and combining the resulting posteriors in some way; this 
might have certain advantages in terms of computation and implementation, however, we do 
not explore it in this article. 

Due to its simple form, inference using the power posterior is often easy, or at least, no 
harder than inference using the ordinary posterior. We discuss three commonly-occuring 
cases: analytical solution in the case of exponential families with conjugate priors, Gibbs 
sampling in the case of conditionally-conjugate priors, and Metropolis-Hastings MCMC more 
generally. 


3.1.1. Power posterior with conjugate priors. When using exponential families with conju¬ 
gate priors, one can often obtain analytical expressions for integrals with respect to the power 
posterior. Suppose P 0 {x) = exp (6*^s(x) — n{0)), where s{x) = (si(x),..., Sk{x)y are the suf¬ 
ficient statistics, and suppose Il{d 6 ) = 7 r^^i,{ 9 )d 6 where t^^,v{9) = exp (6*^^ -vK{9)-y{i,v)), 
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noting that this defines a conjugate family. Then the power posterior is proportional to 

n 

(3.6) WpeAiA oc exp -A + nCn)«;(^)) oc 

1=1 

where = ^ + Cn Yi and thus, the power posterior remains in the 

conjugate family. 

For most conjugate families used in practice, simple analytical expressions are available for 
the log-normalization constant as well as for many integrals with respect to 7r^^^(9). 

This enables one to obtain analytical expressions for many quantities of inferential interest 
under the power posterior, thus providing approximations to the corresponding quantities 
under the relative entropy c-posterior. For instance, one obtains a marginal power likelihood, 

n 

\Yve{xiAde = exp 
1=1 

which can be used to compute robustified Bayes factors or a robust posterior on models, in 
the context of model inference. This is robust to small perturbations to Po (in the sense 
of relative entropy), whereas the usual model inference/selection procedures can be very 
sensitive to such perturbations, for large n; see Section 5.2 for details. 

In Section 2, we used this approach in the toy example involving Bernoulli trials. In 
Section 7.1, we apply it to perform robust inference for the order of an autoregressive model. 

3.1.2. MCMC on the power posterior. Often, it is desirable to place conditionally-conjugate 
priors on the parameters of an exponential family—for instance, placing independent normal 
and inverse-Wishart priors on the mean and covariance of a normal distribution. In such 
cases, one can use Gibbs sampling on the power posterior, because for each parameter given 
the others, we are back in the case of a conjugate prior, and thus—as shown by Equation 
3.6—the full conditionals belong to the conjugate family, making them easy to sample from. 
In Section 7.2, we use Gibbs sampling for variable selection in linear regression with the 
power posterior. 

More generally, samples can be drawn from the power posterior by using Metropolis- 
Bastings MCMC, with the power likelihood in place of the usual likelihood. In Section 7.3, we 
use Metropolis-Bastings for inference in mixtures with a prior on the number of components, 
with the power posterior. 

By a stroke of luck, the mixing time for MCMC with the power posterior will often be 
better than with the standard posterior, since raising the likelihood to a fractional power 
(i.e., between 0 and 1) has the effect of flattening it, enabling the sampler to more easily 
move through the space, particularly when there are multiple modes and n is large. Indeed, 
raising the likelihood to a fractional power—also known as tempering—is sometimes done 
in more complex MCMC schemes in order to improve mixing time, a well-known example 
being MC^ (Geyer, 1991). 

Thus, generally speaking, it is a straightforward matter to use MCMC for sampling from 
the power posterior. Bowever, there is a subtle point that should be carefully noted. Often, 
latent variables are introduced into an MCMC scheme in order to facilitate moves or to im¬ 
prove mixing, and sometimes, such latent variables do not work in the same way for the power 
posterior. For example, in a mixture model, say, Yi=i'^ifAAy latent variables Zi,... ,Zn 
indicating which component each datapoint comes from are often introduced, so that the full 
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conditional distributions for Gibbs sampling from w, (/?, and 2 take nice and simple forms. 
However, when using the power posterior, the likelihood is 11^=1 ( 

it seems that introducing zi,... ,Zn no longer leads to nice full conditionals. On the other 
hand, it may be possible to use a different set of latent variables; see Antoniano-Villalobos 
and Walker (2013) for the case of mixtures. 


4. Connections with previous work 

The c-posterior is mathematically equivalent to the type of posterior approximation result¬ 
ing from approximate Bayesian computation (ABC) (Tavare et ah, 1997; Marjoram et ah, 
2003; Beaumont et ah, 2002; Wilkinson, 2013)—indeed, ABC provided part of our inspira¬ 
tion for considering this form of posterior. However, there are some crucial distinctions to 
note. First, the motivation here is completely different than with ABC: we are concerned 
with robustness to misspecification, while ABC is concerned with inference in models with 
intractable likelihoods. Generally speaking, we assume the likelihood is easily computed, 
which makes inference much more computationally efficient. Another major difference is 
that in ABC, the coarsened posterior is viewed as an undesirable side effect of the approx¬ 
imate nature of ABC, while from our perspective, it is precisely the object of interest—in 
other words, for us it is an asset, not a liability. 

The c-posterior can also be viewed as conditioning on partial information, a technique 
which is often used to improve robustness (Doksum and Lo, 1990; Pettitt, 1983; Hoff, 2007; 
Dunson and Taylor, 2005); also see Cox (1975). Usually, however, this is done by conditioning 
on some insufRcient statistic; for example, Doksum and Lo (1990) perform robust Bayesian 
inference for a location parameter by conditioning only on the sample median, rather than 
the whole sample. Our approach of conditioning on a distributional neighborhood is quite 
different. 

Gibbs posteriors have recently been introduced as a very general framework for updating 
prior beliefs using a generalized “likelihood” (Jiang and Tanner, 2008; Zhang, 2006b; Li et ah, 
2014; Bissiri et ah, 2013). Under certain conditions, for n sufficiently large, the c-posterior 
is approximately proportional to exp{—ad{P 0 , , which can be viewed as a Gibbs 

posterior with “risk” d{P 0 , Pxun)- research involving Gibbs posteriors, an issue of current 
interest is how to choose a so that the concentration of the posterior is appropriately cali¬ 
brated. The fact that Gibbs posteriors can be interpreted as an approximation to a coherent 
Bayesian procedure (the c-posterior) may provide insight into this calibration problem. 

A number of researchers have employed a form of power likelihood obtained by raising the 
likelihood to a power between 0 and 1. Usually, this is done for reasons completely unrelated 
to robustness, such as marginal likelihood approximation (Friel and Pettitt, 2008), improved 
MCMC mixing (Geyer, 1991), consistency in nonparametric models (Walker and Hjort, 
2001; Zhang, 2006a), discounting historical data (Ibrahim and Chen, 2000), or objective 
Bayesian model selection (O’Hagan, 1995). However, recently, the robustness properties of 
power likelihoods have started to be noticed: Griinwald and van Ommen (2014) provide 
an in-depth study of a simulation example in which a power posterior exhibits improved 
robustness to misspecification, and they propose a method for choosing the power; also see 
Griinwald (2011, 2012). Nonetheless, in all such previous research, a fixed power is used, 
rather than one tending to 0 as n —?• 00 . It seems that neither the form of power likelihood 
we use, nor the theoretical motivation for it, have appeared in any prior work. 
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Conceptually speaking, the existing methods that are perhaps most similar to the idea of 
the c-posterior are goodness-of-fit tests that assess whether the data distribution is within 
a neighborhood of the model space (Rudas et ah, 1994; Goutis and Robert, 1998; Dette 
and Munk, 2003; Liu and Lindsay, 2009), however, the methods used previously are very 
different from ours. Closely related to such work is the model credibility index of Lindsay 
and Liu (2009), a concept which has heavily influenced our thinking in the development of 
the c-posterior. 


5. Theory 

In this section, we establish the asymptotic form of c-posteriors (Section 5.1) and their 
robustness properties (Section 5.2). Let X and 0 be standard Borel spaces, and let Ai 
denote the space of probability measures on X, equipped with the weak topology. Let 
{P$ : d G 0} C A4 be a family of probability measures on X such that 6 i-Y Pei^A) is 
measurable for all measurable subsets A Q X. Let 11 be a prior measure on 0, and consider 
the following model: 

0 ~ n, 

Xi,..., i.i.d. ~ Pfl, and 
R H, independently of 9, 

where iL is a distribution on [0, oo). Note that we use (bold) 9 for the random variable, and 
9 for particular values. Define 

G{r)=F{R > r). 

Now, suppose the observed data Xi,... ,Xn E X behave like i.i.d. samples from some Po E Ai. 
Let d : A4 X A4 ^ [0, oo], and for n E (1,2,..let d„ : X^ x T"' ^ [0, ooj. It is assumed 
that 6 I-7- d{PQ, P) is measurable for all P G A4, and d„(-, •) is measurable for each n. 

5.1. Asymptotic form of the c-posterior. The c-posterior takes a simple form as n —>■ oo, 
under mild regularity conditions. We prove a general convergence theorem for c-posteriors 
(Theorem 5.3) and then apply it to c-posteriors derived from relative entropy (Corollary 5.4) 
and weakly-continuous distances (Corollary 5.5). 

5.1.1. Convergence theorem. The following basic lemma captures the underlying principle 
at work in establishing both the asymptotic form of the c-posterior (Theorem 5.3) as well as 
its robustness (Theorem 5.6). 

Lemma 5.1. If U, Un, V, IT G M U {oo} are random variables such that Un > U, F{U = 

n^oo 

V) = 0, P(P < T) > 0, and E|IT| < oo, then 

E(IT \Un<V) -^ E(IT \U <V). 

n—)-oo 

All proofs for this section have been placed in Section A.3. The following condition is 
necessary to avoid certain pathologies; it is always satisfied, for instance, when d(P( 9 , Pq) < oo 
with positive probability and R has a density with respect to Lebesgue measure that is 
positive on [0, oo). 

Condition 5.2. Assume P(d(Pe, Po) = P) = 0 and P(d(Pe, Po) < R) > 0. 

We use to denote convergence with respect to the weak topology. 
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Theorem 5.3. If dn{Xi:n^ xi:n) d{Pe^Po) as n ^ oo and Condition 5.2 is satisfied, then 

(5.1) Il{d9 I dn{Xi:n,xi:n) < R) ^{dd \ d{P 0 ,Po) < P) QC G{d{Po, Po))ll{d 6 ), 

and in fact, 

(5.2) E(fe(e) I djx,.,„,x,,„) <R)^ E('>(e) I d(Pe,P,) < R) = 

for any h G i.e., any measurable /i : 0 —)• M such that f \h{0)\ll{d0) < oo. 

As noted earlier, a case of particular interest arises when R ~ Exp(«), since then G{r) = 
6““*' and the resulting asymptotic c-posterior is proportional to exp(—ad{PQ, Po))Il{d9), by 
Theorem 5.3. This is asymptotically equivalent to exp(—ad{PQ, Pxj^.^))Il{d9), provided that 
d{P 0 , Pxi,n) d{P 0 , Po), and as discussed in Section 4, this is precisely the form of a Gibbs 
posterior; thus, a Gibbs posterior can be interpreted as a large-sample approximation to 
a c-posterior. It is also worth noting that if i? = ro with probability 1 for some ro > 0, 
then G{r) = l(r < ro), and by Theorem 5.3 the asymptotic c-posterior is proportional to 
'l{d{P 0 ,Po) < ro)Il{d9), i.e., it is zero outside the ro neighborhood of Po and reverts to the 
prior inside. 

5.1.2. Application to relative entropy. Suppose Po has density Po and Pq has density p 0 for 
each d G 0. 

Corollary 5.4. Suppose dn{Xi:n,Xi:n) is an almost-surely consistent estimator of D{po\\p 0 ), 
i.e., dn{X\,n,Xi.,n) D{po\\p 0 ). If d{P 0 , Pg) = D{po\\p 0 ) and Condition 5.2 is satisfied, 
then Equations 5.1 and 5.2 hold. 

This establishes the asymptotic form of the relative entropy c-posterior as claimed in 
Equation 3.3. 

5.1.3. Application to weakly-continuous distances. Recall that Ai;„ = ^ ELi V denotes 
the empirical distribution of Xi:„. 

Corollary 5.5. Suppose d : A4 x A4 —?• [0, oo] has the property that d{Pn,Qn) —t d{P,Q) 
whenever P^ ^ P and Qn ^ Q. If Condition 5.2 is satisfied, then Equations 5.1 and 5.2 
hold when dY^i^Xi.^i, x^.jf) di^Pxi-ni Pxi-,„')- 

5.2. Robustness properties. Here, we show that when n is large, the standard posterior 
can be strongly affected by small changes to the observed data distribution Po, particularly 
when performing model inference (Section 5.2.1), while c-posteriors are robust to small 
changes in Po (Section 5.2.2). To see roughly why the standard posterior is not robust, note 
that 

n 

I\.{d9 I Xi-n = Xi.,n) oc exp (E Iogp 0 {xfill{d 9 ) ^ exp (njpologp 0 )ll{d 9 ) 

i=l 

oc exp{-nD{po\\p 0 ))Il{d 9 ), 

assuming the densities po and Po exist. Due to the n in the exponent, even a slight change to 
Po can dramatically change the posterior. On the other hand, by comparison, the relative en¬ 
tropy c-posterior with R ~ Exp(«) is asymptotically proportional to exp(—aD(po\\p 0 ))Il{d 9 ), 
suggesting that the c-posterior should remain stable in the limit as n —?• oo; below we make 
this precise, for a general choice of d{-, •). 
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5 . 2 . 1 . Lack of robustness of model inference under the standard posterior. Suppose that for 
each k in some countable index set, we have a model A4k = {Pe ■ 0 G 0fc}, where 0^, is a 
tfc-dimensional Euclidean space. Let TT{k) be a prior on the model index k, and for each k, 
let TTfc be a probability density with respect to Lebesgue measure on 0^,; this induces a prior 
n on the disjoint union 0 = Ufc 

It is well-known that, under mild regularity conditions, the marginal likelihood p{xi:n\k) = 
fok p(Ti:n| 6 *) 7 rfc( 6 *)(i 6 * has the asymptotic representation 

|,x p(Xi:nld^)7rk(dl) /27r\W2 

ldetH( 0 *;p,)lPnnJ ’ 

as n oo, where O'f = argmax 0 g 0 ^ is the maximum likelihood estimator for model 

k, 61 = argmin^^Q^ D(po\\p 0 ) is the minimal Kullback-Leibler (KL) point within model k, 
and H{ 6 ]Po) = - J Poi'^e'^ogpe). Here, a„ ~ means 1. Letting fn{k) = 

— ^logp(xi:n\ 6 ^), this implies that 

(5.3) pixi:n\k) ~ Cke~'^P^'^^n~^’‘P 

for a constant Ck not depending on n or xi:„. Typically, fn{k) f{k) := D{po\\p 0 *) — 
f PologPo- Note that f(k') < f{k) if and only if model k' is closer to Po than model k 
in terms of minimal KL divergence; also, note that the marginal likelihood automatically 
penalizes more complex models via the factor. 

Given such an asymptotic representation, it is easy to see that for any k, if there exists 
k' such that f{k') < f{k), then 7r{k\xi:n) —>■ 0 as n —>■ oo. Consequently, even the slightest 
change to Po can result in major shifts in the posterior on k, when n is large. For instance, 
it often happens that the models are nested, e.g., A4i C J \42 C • • • and ti < t 2 < ■ ■ ■ ■ 
This is the case, for example, when Mk consists of fc-component mixtures, or fcth-order 
autoregressive models; variable selection is slightly more complicated but ultimately similar. 
If the collection of models is correctly specified with respect to Po, then there is some minimal 
k' such that D{po\\p 0 *^,) = 0, and thus TT{k\xi.n) —?• 0 for all k < k' (and typically, the posterior 
on k will concentrate at this k'). However, even the slightest perturbation to Po will usually 
result in either (a) an increase in this minimal k', or (b) a situation where inf^ D(po\\p 0 *) is 
not attained at any k, causing the posterior on k to diverge, in the sense that 7r(A:|xi:„) —>■ 0 
for all k. Hence, model inference with the standard posterior is not robust. 

5.2.2. Robustness of the e-posterior. The definition of robustness, roughly speaking, is that 
small changes to the distribution of the data result in small changes to the resulting infer¬ 
ences. This can be formalized by requiring that asymptotically, the outcome of an inference 
procedure be continuous as a function of Po, with respect to some topology (the weak topol¬ 
ogy being a standard choice) (Huber, 2004). From this perspective, the lack of robustness 
of the standard posterior can be thought of as a lack of continuity with respect to Po, 
asymptotically. 

We show in the following theorem that the asymptotic c-posterior inherits the continuity 
properties of whatever distance d(-,-) is used to define it. Consequently, the c-posterior 
will be robust to perturbations to Po, provided that d(-, •) is chosen appropriately. In the 
terminology of Section 3, if the observed data distribution Po is close to the ideal data 
distribution Pen then the c-posterior will be close to what it would be if Pq — Poj- 
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To interpret the theorem, recall that on any metric space, a function f{x) is continuous if 
and only ii Xm ^ x implies f{xm) fix). Thus, to show continuity as a function of Po (in 
some topology), one must show that if Pm —t Po, then the resulting sequence of asymptotic 
c-posteriors converges as well. In fact, if d(-, •) is continuous (in this same topology), then 
the theorem shows a bit more than that, since then Pm —t Po implies d{PQ, Pm) diPg, Po). 

Theorem 5.6. If Pi, P 2 ,.-- G -M. such that diPo,Pm) - > d{P 0 ,Po) for H-almost all 

m—)-oo 

9 eQ, and Condition 5.2 is satisfied, then for any h G L^(n), 

E(/i(6») I diPe,Pm) <R)^ E(h(6») | diPe,Po) < R) 

as m ^ oo, and in particular, HidO \ diPg, Pm) < R) n((i 6 * | diPg, Po) < R). 

In particular. Theorem 5.6 implies that the c-posterior is robust in the context of model 
inference, since taking h{9) = l(d G 0^), we have 

n(0fc I diPs,Pm) < R) —t n(0fc I diPg,Po) < R) 

as m ^ oo, under the assumptions of the theorem. 

6 . Extensions 

6.1. Time-series c-posterior based on relative entropy rate. Suppose the sequence 
of observed data (xi,... ,Xn) is a partial sample from a stationary and ergodic process 
with distribution Po, and suppose the model : 6 * G 0} consists of stationary hnite- 
order Markov processes. Assume that for some sigma-hnite measure p on A, for all n G 
{1, 2,...} and all 6 * G 0, the hnite-dimensional distributions have densities Poixi,..., x„) and 
Peixi,... ,Xn) with respect to the product measure pp, and assume Ep^| logpo(^i:n)| < oo 
and EpJ logp 0 (Xi:„)| < oo. 

A natural way of assessing the discrepancy between the processes Pg and Pq is by the 
relative entropy rate (Gray, 1990), 

ViPoWPe) = lim -Dipoixi.m)\\peixi:n)). 

n^oo Ti 

Suppose dniXi.n,xi.m) is an a.s.-consistent estimator of ViPoWPe) when (Xi,X 2 ,...) ~ Pe 
and {xi,X 2 ,...) ~ Po, and consider the c-posterior n((i 6 * | dniXi.n,xi.m) < R), with R ~ 
Exp(q;). Then by Lemma 5.1, the asymptotic c-posterior is 

Hide I ViPoWPe) <R)oc expi-aViPoWPe))Hide). 

If Pe is fcth-order Markov, then 

ViPoWPe) = -RiPo) - Ep„ \ogpeiXk+i\Xi, ..., A^) 

where RiPo) is the entropy rate of Pg, which we assume is hnite (Gray, 1990, Lemma 2.4.3). 
Further, when (xi, X 2 , • • •) ~ Ro, 

1 ” 

- logpeixi\xi,..., Xi_i) -^ Ep„ logpeiXk+i\Xi, ...,Xk) 

Tl ^^ n^oo 
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with probability 1 , by the ergodic theorem (Breiman, 1968, 6.28). Combining this with the 
small-sample correction (applied heuristically in this setting) suggests the approximation 


Il{de I 4 (Xi.„, xi.,n) < i?) oc exp - nCn - ■H(Po) “ “ XI logP 0 (Ti|a;i,..., Xi-i 


1=1 


n 

oc n(d6') 'Wpe{xi\xi ,... ,Ti_i)^”. 

i=l 


Thus, as in the i.i.d. case, the end result is an approximation obtained by simply raising the 
likelihood to the power In Section 7.1, we apply this to perform robust inference for the 
order of an autoregressive model. 


6.2. Regression c-posterior based on conditional relative entropy. In regression, 
one observes covariates/predictors ti, ... associated with target values yi,... ,yn, and 
models the conditional distribution of y given x. As in the i.i.d. setting, in order to allow for 
contamination/misspecification, let us suppose that Yi\xi is drawn from the model P 0 {y\x) 
for f = 1 ,... ,n, and the observed values yi:n are a slightly corrupted version of Yi:n, in 
the sense that dn(Yi:n,yi:n\xi:n) < R for some measure of discrepancy d„(-,-|-). Suppose 
(ti, yi),..., {xn, Pn) behave like i.i.d. samples from some Po(t, y). For notational clarity, let 
us assume that these densities on x and y are with respect to measures that we will denote 
by dx and dy, respectively. 

A natural choice of discrepancy between the conditional distributions Po{y\x) and P 0 {y\x) 
is the conditional relative entropy, 

D 0 \= f Po{x,y) log ^°^^|^^ dxdy, 

J Pe[y\x) 

and in turn, an a.s.-consistent estimator of this quantity is a sensible choice for d„(-,-|-). 
Then, by Lemma 5.1, the resulting c-posterior converges to a nice asymptotic form: 

n(dd I dn{Yx.,n,y\:n\xi,n) < R) H {d 0 \ D 0 < R) (X eyip{-aD 0 )Tl{d 6 ) 

(X exp 

if we take R ~ Exp(q;) as usual. To obtain an approximation that is applicable for smaller n 
as well, we apply the same small-sample correction as before, replacing a by nC,n- Combining 
this with an empirical approximation to the integral suggests using 


a 


jPo{x,y) logp0{y\x) dx dy^Il{de) 


n(dd I dn{Yi:n,yi:n\Xi:n) < R) 9 S ^Xp (CnE* (dl))n(dd) 


= Il{de)Y[pe{yi\xi)‘^". 

i=l 


Consequently, once again, we arrive at a power posterior approximation to the c-posterior, 
allowing us to bypass the computation of d„(-, -I-). In Section 7 . 2 , we apply this to perform 
robust variable selection in linear regression. 


7. Applications 

7.1. Autoregressive models of unknown order. We illustrate using the c-posterior to 
perform inference for the order of an autoregressive model in a way that is robust, not only 
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to the form of the distribution of noise/shocks, but also to misspecification of the structure 
of the model, such as time-varying noise. This serves as a nice demonstration of how the 
robustified marginal likelihood can be computed in closed form when using conjugate priors, 
and provides some insight into why coarsening works. 

Consider an AR(fc) model, that is, a fcth-order autoregressive model: 


k 


i=i 


for t = 1,..., n, where £i,..., £„ i.i.d. ~ ^^(0, cr^) and Xj = 0 for t < 0 by convention. Let 
nik) be a prior on the order k, let 9i,..., 6k\k i.i.d. ~ A/"(0, ctq), and for simplicity, assume 
cr^ is known. 

We apply the time-series c-posterior developed in Section 6.1 to obtain robustness to 
perturbations that are small in the sense of relative entropy rate. Since I3\k has been given a 
conjugate prior, then as described in Section 3.1.1, we can analytically compute the resulting 
marginal power likelihood, 


Pc{xi:n\k) := / p{xi:n\0,k)‘^'^Tr{9\k)d6 


yjnw 


J\f{9 I 0,allkxk)d9 


.|A|l/2 


-N{Xi.,n I 0, a^Inxnf 


where A = CnAf + crg Mij = Xt-iXt-jjcP', and Uj = XtXt-ijcP', by straightfor¬ 


ward calculation. This, in turn, can be used to compute a robustified posterior on the model 
order k, defined as TTcik\xi:n) oc Pc{xi:n\k)Tr{k) . This is expected to be robust to departures 
from the AR(A:) model that require more than a samples to distinguish, and thus, it will 
favor values of k that are consistent with the data to within this specified tolerance. 

To demonstrate empirically, we generate data from a process that is close to AR(4) but 
exhibits time-varying noise that cannot be captured by the model: 


(7.1) 


Xt — ^ ^ d^Xf-i -\- et-\- \ sin t 
r=i 


where 9 = (1/4,1/4,—1/4,1/4), St i.i.d. ~ A/'(0,1), and Xt = Q iov t < 0. We apply the 
model above to such data, and compare the standard Bayesian approach to the coarsened 
approach. 

For the model parameters, we set = 1 to match the true value, and take Uq = 1. If 
one expects a particular amount of misspecification, this can be used to make a principled 
choice of a; see Section 7.2. Here, we assess sensitivity to the choice of a, by considering 
the c-posterior on A: as a varies, when n = 10^, with an improper uniform prior on k; see 
Figure 4 (upper right). There is a fairly wide range of values that give similar results—for 
a’s between 100 and 1200, the large majority of the mass is on the correct value of k, namely 
k = A. 

To visualize what happens as n increases, we set a = 500, and consider the log of the 
marginal likelihood. Due to the misspecification, the standard posterior strongly favors 
values of k much greater than 4 when n gets sufficiently large; see Figure 4 (lower left). 
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Figure 4. Autoregression example. Upper left: Data sampled from the 
process in Equation 7.1. Upper right: Sensitivity analysis, displaying the 
c-posterior on k as a varies, when n — 10^. Lower left: Log marginal likeli¬ 
hood of AR(A:) model for /c = 0,1,..., 20, on increasing amounts of data from 
this process. Lower right: Log of coarsened marginal likelihood for the same 
model, on the same data. 


Meanwhile, the c-posterior stabilizes to a distribution on k favoring fc = 4; see Figure 4 
(lower right). For values of n less than a, the standard and coarsened approaches yield 
similar results, however, as n increases beyond a, they differ markedly. 

More generally, this type of picture is typical for the log marginal likelihood when compar¬ 
ing models of increasing complexity. As discussed in Section 5.2, the log marginal like lih ood 
automatically penalizes more complex models, via the term — log R where tk is the dimen¬ 
sion of the parameter space, e.g., tk — k for the AR(/c) model above; this penalty is visible 
in the linear decline exhibited in the n = 100 plot. As n increases, this complexity penalty 
only increases proportionally to log n, and thus it becomes overwhelmed by the main term of 
order n, involving the log-likelihood at the maximum likelihood estimator within model k. 
When n is sufficiently large, the following pattern emerges, as seen in the n = 10000 plot (for 
the standard approach): for model complexity values k that are too small, there is a clear 
lack of fit, and as k increases the log marginal likelihood increases rapidly until the model 
can fairly closely approximate the data distribution, at which point it plateaus, continuing 
to increase only slightly after that as only fine grain improvements can be made. 
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From this perspective, the reason why the coarsened marginal likelihood “works” is that 
when n is large, it maintains a balance between the model complexity penalty and the main 
log-likelihood term, by behaving as though the sample size is no larger than a. 


7.2. Variable selection in linear regression. Consider the following spike-and-slab model 
for variable selection: 


W ~ Beta(r, s) 

Pj ~ A/'(0, l/Fo) with probability W, otherwise Pj = 0, for each j = 1,... ,p 
A ~ Gamma(a, b) 

Yi\P,X rsj N'{P'^Xi, 1/A) independently for z = 1,..., n. 


Models of this type are often used to infer which covariates xn,..., xip are predictive of 
the target variable yi, by considering which coefficients Pj have a high posterior probabil¬ 
ity of being nonzero. This provides valuable insight into the relationships present in the 
data generating process. However, usually, it is unlikely that the data exactly follow the 
J\f{P'^Xi,!/X) form, and although the model exhibits some robustness to departures from 
normality, it is not robust to departures from the linearity assumed in the mean function 
P'^Xi = PiXii -|- • • • -|- PpXip. For instance, if the mean is actually Pig{xii) where g is close 
to but not exactly linear, and xn is correlated with other covariates, then the posterior will 
typically make additional coefficients nonzero in order to compensate. 

We demonstrate how the c-posterior provides robustness to misspecification of this type. 
This example also provides an opportunity to show how Gibbs sampling can be used with 
power posteriors when conditionally-conjugate priors have been chosen. 

As described in Section 6.2, in the regression setting, the c-posterior based on conditional 
relative entropy can be approximated by the power posterior obtained by raising the likeli¬ 
hood to Cn = 1/(1 + n/a), as before. If we first integrate W out of the model, the resulting 
power posterior is 

ttPP, A|y) oc 7r{p, X)p{y\P, A)^". 

Due to the use of conditionally-conjugate priors, the full conditionals for Pj and A can 
be derived in closed form, by standard calculations. We give the formulas here without 
justification: 


7rc{X\P,y) = Gamma (^A 


a + lnCn, b+ |CnEr=i(?^i 



and one can sample from TrdPjlP-j, X,y), where /3_j = {Pi : £ d /); by setting Pj = 0 with 
probability 

= 01 A.») = (i + yiVt exp j 

where L = Lq + ACnELi^P’ ^ = {Kn/L)YTi=ihxij, and 5i = y^ - and 

otherwise sampling Pj from L~p. 


7.2.1. Simulation example. To demonstrate empirically, first consider a simulated example 
where the mean of the observed data is a slightly nonlinear function of a single covariate, 
plus a constant offset: 

(7.2) 


Vi — Poi + Po2{Xi2 + ’^^{ 2 ) + G 
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where /3oi = —1, P 02 = 4, and £ 1 ,... i.i.d. ~ A/"(0,1). Following standard practice, sup¬ 
pose Xii = 1, to accomodate a constant offset. Suppose there are five covariates Xi 2 ,, XiQ 
distributed according to a multivariate skew-normal distribution (Azzalini and Capitanio, 
1999) which has been centered and scaled so that each covariate has zero mean and unit 
variance: Xij = {Xij — KXij)/a{Xij) for j = 2,... ,6, where Xi ~ with shape 

a = (0.6,2.7, —3.3, —4.9, —2.5) and scale matrix 


/ 1.0 

-0.89 

0.93 

-0.91 

0.98 \ 

-0.89 

1.0 

-0.94 

0.97 

-0.91 

0.93 

-0.94 

1.0 

-0.96 

0.97 

-0.91 

0.97 

-0.96 

1.0 

-0.93 

0.98 

-0.91 

0.97 

-0.93 

1.0 ; 


The a and Q above were randomly-generated; there is nothing particularly special about 
them, except that fl was chosen so that the covariates would be fairly strongly correlated. 
Figure 5 (top) shows a scatterplot of yi versus Xa for 200 samples, as well as the mean as a 
function of x^. 

For the model parameters, we choose r = 1 and s = 2p (in order to favor having 0(1) 
nonzero coefficients, regardless of p), Lq = 1, and a = b = 1. To choose the coarsening 
parameter a, recall that the c-posterior is obtained by conditioning on the (estimated) con¬ 
ditional relative entropy being less than R, where R ~ Exp(q;). The relative entropy between 
two Gaussians and is ^(hi — h 2 )^- Thus, if we expect the misspecifi- 

cation/contamination to shift the mean function by approximately ±(J on average, and the 
noise has standard deviation a, then it is reasonable to choose a so that Ei? j ( 20 "^), i.e., 
OL ~ . In the present situation, by cheating and using our knowledge of the truth, we 

choose 6 = 0.2 and cr = 1, leading to a = 50. 

For each n E {100,1000,5000,10000,50000}, ten datasets were generated, and for both 
the standard posterior and the coarsened posterior, 50000 Gibbs sweeps were performed on 
each dataset, the first 5000 of which were discarded as burn-in. 

Figure 5 (middle) shows the average of these posteriors on k over the 10 datasets, for the 
standard and coarsened posteriors. Note that the “true” number of nonzero coefficients in 
Equation 7.2 is fc = 2 (/^oi and I 3 q 2 ). 

Figure 5 (bottom) shows the posterior cumulative distribution function (c.d.f.) and 95% 
credible interval for each coefficient f3i,..., Pq when n = 10000, for the standard and coars¬ 
ened posteriors. Recall that the “true” values are /3i = — 1, /52 = 4, and P 3 = • • ■ = Pq = 0. 
The 95% intervals for the standard posterior are quite far from the true values of Pi, Pi, and 
P^, while all of the 95% intervals for the c-posterior contain the true values; also note that 
for P 3 ,... ,Pq, most of the c-posterior probability is at zero. The case of Pi, in particular, 
illustrates that in addition to incorrectly inferring which coefficients are nonzero, under mis- 
specihcation, the standard posterior can also lead to incorrect inferences about the values of 
the nonzero coefficients. The c-posterior mitigates this by more appropriately calibrating the 
amount of concentration; however, the price to be paid is that this can cause the c-posterior 
to be considerably more diffuse than necessary, such as in the case of P 2 . 

7.2.2. Modeling birthweight of infants. The Gollaborative Perinatal Project (GPP) collected 
data from a large sample of mothers and their children, measuring many medical and socioe¬ 
conomic variables from before and during pregnancy, as well as early childhood (Klebanoff, 
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Figure 5. Variable selection with simulated data. Top: Scatterplot of the 
target variable yi versus Xj 2 , as well as the mean function (black line). Middle 
left: The posterior on the number of nonzero coelRcients k favors larger values 
as n increases. Middle right: The c-posterior favors the “true” number, k = 2, 
even as n grows. Bottom left: Posterior c.d.f. for each coefficient (blue), and 
95% credible interval (red). Bottom right: Same, for the c-posterior. 


2009). Using a subset of the CPP data, we illustrate how the c-posterior can be used to 
analyze the relationship between birthweight and a number of predictor variables. 

The dataset we use contains n = 2379 subjects, and 71 covariates that are potentially 
predictive of birthweight. The data are preprocessed to normalize each covariate as well as 
the target variable, by subtracting off the sample mean and dividing by the sample standard 
deviation for each. As usual, a constant covariate is appended, making p = 72. We use the 
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body length (cm) k (# of nonzero coefficients) 




j (coefficient index) j (coefficient index) 


Figure 6. Variable selection for modeling birthweight. Upper left: Scatter- 
plot of birthweight (the target variable) versus body length at birth. Upper 
right: The standard posterior includes several more nonzero coefficients than 
the c-posteriors. Lower left: Posterior probability of inclusion for each coeffi¬ 
cient; only the top 16 are shown, see list below. Lower right: Posterior mean 
of each coefficient, for the same 16. (Top 16 variables: 1. Body length, 2. 
Mother’s weight at delivery, 3. Gestation time, 4. African-American, 5. Center 
6, 6. Center 2, 7. Center 3, 8. Mother’s weight prepregnancy, 9. Previously preg¬ 
nant, 10. Cigarettes per day, 11. ^ prenatal checkups, 12. Smoker/non-smoker, 
13. Mother’s BMI prepregnancy, 14. ^ previous pregnancies, 15. Triglyceride 
level, 16. Center 10.) 


same model parameters as in the simulation example. Rather than choose a single value of 
a, we explore the data at varying levels of coarseness, by considering a range of a values. 

For each a G {100, 500,1000,2000, oo}, we run the sampler for 10^ Gibbs sweeps, discard¬ 
ing the first 1000 sweeps as burn-in. The sampler is initialized by setting all the coefficients to 
zero; initializing with a sample from the prior yields identical results. To interpret a in terms 
of Euclidean notions, we estimate from posterior samples that A ~ 2.5 to 3, and thus, by 
the formula a ~ = 2/{\5‘^) derived above, the values of a above roughly correspond 

to allowing for misspecification/contamination of magnitude d G (0.09,0.04,0.03,0.02,0), 
respectively, or, when scaled to the original units, roughly dkg G {0.045,0.02,0.015,0.01,0} 
kilograms. 

The posterior on the number of nonzero coefficients k (Figure 6, upper right), and the 
posterior probability of inclusion (Figure 6, lower left), show that the standard posterior 
includes around 10 out of the 72 coefficients, while the c-posterior employs a more parsi¬ 
monious representation, depending on a. At a = 100 (dkg ~ 0.045), typically only a single 
variable is included, namely, body length. It makes sense that body length would be strongly 
predictive of weight, and the scatterplot in Figure 6 (upper left) confirms this. At a = 500 
(4g 0.02), both body length and mother’s weight at delivery are included, as well as gesta¬ 

tion time, with somewhat lower probability; again, it makes sense for these to be predictive 
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of birthweight. As a increases, additional variables are included to account for finer aspects 
of the data, until we reach the standard posterior at a = oo. 

All of the variables included by the standard posterior could conceivably be predictive of 
birthweight, although after adjusting for primary variables such as body length, it is possible 
that they are only being included due to misspecification. Since it seems likely that there 
would be misspecification at least at the 5kg = 0.01 kg level (i.e., ~ 0.02 pounds, a = 2000), 
the high probability placed on some of the additional variables included by the standard 
posterior is dubious, as is the precision with which it purports to infer the corresponding 
coefficients. Further, it seems inevitable that if n were larger, then even more coefficients 
would be included by the standard posterior. If there is misspecification, then as n grows, 
eventually the interpretation of which coefficients are included, and their values, becomes 
less related to practically significant associations and more related to the fact that the model 
is compensating for its limitations. 

7.3. Mixture models with a prior on the number of components. Consider a finite 
mixture model: 

k 

Ai,.. .,Xn\k,w,ip i.i.d. ~ E 

i=l 

and place a prior 7r(A:, w, Lp) on the number of components k, the mixture weights w, and the 
component parameters <p. This type of model is not robust to misspecification of the family 
of component distributions (/^ :(/?£$), resulting in negative consequences in practice, since 
we might reasonably expect the observed data Xi,... ,Xn to come from a finite mixture, but 
it is usually unreasonable to expect the component distributions to have a nice parametric 
form. We illustrate how the c-posterior enables one to perform inference for the number of 
components, as well as the mixture weights and the component parameters, in a way that 
is robust to misspecification of the form of the component distributions. This example also 
serves to demonstrate the use of Metropolis-Hastings MCMC for inference with a power 
posterior. 

We approximate the relative entropy c-posterior using the power posterior, defined as 

n k ^ 

TTcik, w, (p\xi.,n) OC Tr{k, W, if) n(E 

j=l i=l 

The usual approaches to inference in mixtures are based on latent variables indicating which 
component each datapoint comes from, but unfortunately, that does not seem to work here. 
There are, nonetheless, a few possible approaches to doing inference, for instance, Antoniano- 
Villalobos and Walker (2013) developed an auxiliary variable technique for posteriors of 
this form, and it would also be possible to use reversible jump MCMC (Green, 1995). To 
keep things as simple as possible, however, we assume an upper bound on k, say k < 
m, and reparameterize the model in a way that enables one to simply use plain-vanilla 
Metropolis-Hastings (MH) MCMC on a fixed-dimensional space. Specifically, we rewrite the 
mixture density as where Wi = fi'(L’i)/ XljLi g{v) = max{n — c, 0}, 

so that Wi = t) li Vi < c. Letting Vi,... ,Vm ~ Gamma(a,6) i.i.d., conditioned on the event 
that > 0) letting pi,... ,Prn be i.i.d., yields a mixture model in which the 

prior on the number of components k (that is, the number of nonzero weights) is Tr(k) oc 
Binomial(A:|m,p)l(fc > 0) where p = P(nj > c). 
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7.3.1. Skew-normal mixture example. To demonstrate robustness to the form of the com¬ 

ponent distributions, we consider a univariate Gaussian mixture model, applied to data 
generated i.i.d. from the two-component mixture 4,1, 5) -|- 1,2, 5), where 

s,a) is the skew-normal distribution (Azzalini and Capitanio, 1999) with location 
scale s, and shape a; see Figure 1 (top). For the model parameters, we assume an upper 
bound of m = 10 components, and dehne the prior on the component means and precisions 
as Hi ^ 5^) and log(Aj) ~ A/"(0, 2^) independently for i = 1,..., m, where the compo¬ 

nent densities are of the form fij,,\{x) = N'{x\fj„ A“^). We use a prior on k and w as in the 
preceding paragraph, with a = l/m, 6=1, and c such that p = P(ui > c) = 1/m. 

For the c-posterior, we set Cn = 1/(1+ n/Q;) (following Equation 3.4), and choose a = 100; 
this can be interpreted as saying that we want the posterior to behave as though at most 100 
samples are available. The top panel in Figure 1 illustrates that based on 100 samples, one 
can visually determine that there are two large groups of roughly equal size, and can roughly 
determine their location and scale, but cannot determine their precise form—in particular, 
one cannot tell that they are not actually Gaussian. 

For each n G {20,100, 500, 2000,10000}, hve independent datasets of size n were generated, 
and 10^ MH sweeps were performed for both the standard and coarsened posteriors. Each 
sweep consisted of MH moves on each (/ri,Aj) and Vi separately. Eigure 1 (left) shows the 
average of the approximated posteriors on k over these hve datasets. Eor typical samples 
from the standard and coarsened posteriors when n = 10000, Eigure 1 (right) shows the 
mixture density individual (weighted) components Wif^^^ii^)- 

As expected, since the data distribution cannot be represented as a hnite mixture of 
Gaussians, the standard posterior introduces more and more components as n increases, 
in order to obtain an adequate ht to the data. Meanwhile, in accordance with our visual 
intuition that, based on 100 samples, there appear to be two large groups, the a = 100 
c-posterior on k shows strong support for two components, no matter how large n becomes. 

The typical sample from the standard posterior provides a better ht to the distribution of 
the data, however, it has several more than two components, obscuring the fact that there 
are two large groups. On the other hand, the typical sample from the c-posterior does not ht 
the data distribution as well, but it clearly indicates that there are two groups, and provides 
an interpretable representation of their locations and scales. 

7.3.2. Shapley galaxy dataset. The galaxy dataset of Roeder (1990) is a classic benchmark 
for nonparametric mixture models, but it is somewhat outdated, and rather small, with 
only n = 82. Drinkwater et ah (2004) provide a more recent and larger dataset of the 
same type, consisting of the velocities of 4215 galaxies in the Shapley supercluster, a large 
concentration of gravitationally-interacting galaxies; see Figure 7. The clustering tendency 
of galaxies continues to be a subject of interest in astronomy. However, due to the hlament- 
like nature of the distribution of galaxies, it seems likely that any such clusters will not be 
Gaussian. 

Nonetheless, with the c-posterior approach, a Gaussian mixture model can be used to 
good effect, to identify clusters that are approximately normal. By varying the coarsening 
parameter a, one can explore the data at varying levels of precision, allowing for greater or 
smaller departures from normality. 

We use the same model as above, but with m = 20 and a data-dependent choice of prior 
parameters: Hi ^ A/'(t, d^) and log(Aj) ~ A/'(log(4/d^), 2^). For a G (20,100, 500,1000}, we 
run the sampler for 10^ MH sweeps, with a burn-in of 10^ sweeps. For the standard posterior 
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Figure 7. Gaussian mixture with a prior on the number of components k, 
applied to the Shapley galaxy data. Top left: Histogram of the data, in units of 
1,000 km/s, excluding a small amount of data extending in a tail up to 80,000 
km/s. Top right: C-posterior on k for a range of a values; a = cx) is the 
standard posterior. Middle and bottom: Mixture density (dotted black line) 
and components (solid colors) for prototypical samples from the c-posterior, 
for a range of a values. 


(a = oo), mixing is considerably slower; to compensate, we use 10® sweeps with a burn-in of 
2.5 X 10®. This illustrates how inference can be easier under the c-posterior. 

As shown in Figure 7, when a is small, the c-posterior tolerates greater departures from 
normality, and uses a smaller number of components to represent the data. For instance, 
from a glance at the histogram, one can visually distinguish three or four large groups which 
appear roughly unimodal, and when a is around 100 to 500, samples from the c-posterior 
tend to provide a mixture representation that corresponds well to these intuitive groups. For 
larger values of a, additional mixture components are employed, to account for finer and hner 
grained aspects of the data distribution. By the time a = oo, i.e., the standard posterior, 
the large-scale structures have mostly been fragmented into many small components. 

Of course, in a univariate setting like this, one can already visually see the large-scale 
groups, but clusters in high-dimensional data are not so easy to visualize, and having a tool 
like the c-posterior to find structures at varying levels of precision may be very useful. In 
most applications, the primary use of mixture models is not density estimation, but rather. 












































ROBUST BAYESIAN INFERENCE VIA COARSENING 


27 


to provide an interpretable summary of the data in terms of clusters, and in these cases the 
c-posterior approach may have much to offer. 

8. Discussion 

The c-posterior approach seems promising as a general method of robust Bayesian in¬ 
ference, being both computationally efficient and conceptually well-grounded. There are a 
number of directions that would be interesting to pursue in future work. Further investi¬ 
gation of the small-sample correction is needed—in particular, assessing its accuracy, and 
justifying its use in non-discrete cases, if possible. We have focused on relative entropy due 
to the computational advantages, but it would be interesting to explore using other statisti¬ 
cal distances, particularly if fast inference methods could be developed for them as well. It 
would be nice if there was a way of inferring the coarsening parameter a from data, somehow; 
this would enable one not only to perform robust inference, but also to infer the amount of 
misspecification, and ideally, to achieve statistical efficiency when the model is correct. It 
would be beneficial, but perhaps difficult, if precise guarantees could be provided regarding 
frequentist coverage properties of the c-posterior, under misspecification. Finally, the scope 
of application of this way of thinking seems potentially larger than Bayesian inference; it 
might be interesting to explore adaptations to frequentist procedures. 

Appendix A. Proofs 

A.l. C-posterior for the toy example in Section 2. Letting Z = \{D{px\\px) < R), 
by Bayes’ theorem we have that for h G 

Ii{h\Z = 1) oc P(Z = l\h)U{h) (X F{Z = l\h) 

h h 

E{F{Z = h) I h) E{eM-(^D{p,\\px)) \ h) 

where (a) is since n(/j) = 1/2, and (b) is by the law of iterated expectations, and (c) by 
the fact that F(R > r) = exp(—or). This is easily computed exactly, since, letting S = 
= 'iT'Pxi^), we have AIHq ~ Binomial(n, 1/2) and AjHi ~ BetaBinomial(n, 1,1) = 
Uniform{0,1,... ,n}. To derive the approximation in Equation 2.1, we use Equation A.l 
below: 

n 

E{exp{-aD{p^\\px)) \ 0 ,h)^ a^jaexp{-anD{p^\\pg)) = c 

1=1 

where = XjiXjn + l/«), Pei^x) = Bernoulh(T|0) = for x G {0,1}, and c is a 

constant that does not depend on 9 ov h. If h = Hi, then this yields 

n 

F(Z = l|Hi) = E(E{exp{-aD{p^\\px)) \ e,Ri) | Hi) ^ e(c | Hi) 

i=l 

= c [ ^“”^(1 — ^0 = cB( 1 + anX, 1 -|- «„(! — x)). 

Jo 

If h = Ho, then 6 = 1/2 with probability 1, so F{Z = l|Ho) ~ c(l/2"”/”)"' = c/2"". Thus, 
n(Ho|Z = 1) = F{Z = l|Ho)/(P(Z = l|Ho) + F{Z = l|Hi)) « 1/(1 + 2""H(1 + a^x, 1 + 
an{l - x)). 
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A. 2. Justification of the small-sample correction when \X\ < oo. Let Ak = {p E 

• SjPi = 1) Pi > 0 Vz}. Let s G Afc. We argue that if W,..., i.i.d. ~ s and 
= k £r=i = j) for j = 1,..., k, then for p E Ak near s, 

(A.l) Eexp(-Q;D(p||s)) (nCn/«)^ exp(-nCnA>(p||s)), 


where (n = + !/«). We use bold here to denote random variables. For x E 

define C(x) E such that C{x)ij = Xil{i = j) — XiXj, and denote x' = (xi,... ,Xci-i). 

First, for q E A^ near p, 

(A.2) D{p\\q) ^ lx\p, Q) = W- Q7C{qr\p' - q') 

by Propositions A.l and A.2 below. By the central limit theorem, s is approximately 
A/'(s,C'(s)/n) distributed. Therefore, letting q ^ Af{s,C{s)/n) and C = C{s'), 

Eexp(— q;L)(p||s)) Eexp(—Q;L>(p||qr))l(q G A^,) 

Eexp - |(p' - q'yC~\p' - q')^ 

= ( 27 r)~ |C'/q;|^A j J\f(p'\q', C/a)J\f{q'\s', C/n)dq' 


= ( 27 r)^ 2 ^ \C/a\^^'^J\f{p'\s', (l/o + l/n)C) 

= (T7GLlAi) ’ '=>=p(-l(i/« + i/«r'(p' 

£ (nCn/a) V exp(-nC„-D(p||s)), 


s'rC-Hp’ - /)) 


where (a) is by Equation A.2 along with the approximation C{q') ~ (b) uses the 

convolution formula for independent normals, and (c) is again by Equation A.2. This yields 
Equation A.l. 

It is well-known that chi-squared distance is a second-order Taylor approximation to rel¬ 
ative entropy (Cover and Thomas, 2006, Lemma 17.3.3); for completeness, we include the 
proof. 


Proposition A.l. For p,q E A^, D(p\\q) = |x^(PW) + o(||p — g|p) as p ^ q, where 

D{p\\q) = EiPifog(P*/9i) o-nd X^{P,q) = Ei(P* “ 

Proof. Fix b > 0, and define /(a) = alog(a/6) for a > 0. Then by Taylor’s theorem, 
f{a) = fib) + f{b){a -b) + ir (6)(a - b)^ + o(|a - b\^) 

= (a — b) + -- — -j-P + o(|a - b\y 
as a ^ b. It follows that 

J^Pi fog 7 = J2^Pi -qi) + lYl ^ + ^(IIp “ d) + o(l|p - q\\y 

J 1 J ^ J 

1=1 I I 

as p ^ q. □ 

The following result expresses the chi-squared distance x^iP^Q) fo terms of the (k — 1)- 
dimensional Mahalanobis distance for Z' when Z ~ Multinomial(1, g). Eor interpretation, 
note that C below equals Cov(Z') when Z ~ Multinomial(l, g). 
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Proposition A.2. For anyp,q G = ip'~Q'yC ^{p' — q') where C G 

such that Cij = gil(z = j) — qiqj- 

Proof. By the Sherman-Morrison formula for rank-one updates, C~^ = (diag(g') — q'q'^)~^ = 
diag(g')“^ -|- (l/g^)!!^ where 1 = (1,..., 1)^, hence 

(p’ - q’yc-Hp' - ?') = E kciA + 

“ qi Qk 

and {Pi - Qi) = (1 “ Pk) - ^ - Qk) = Qk - Pk- □ 


A.3. Proofs of results in Theory section. 


Proof of Lemma 5.1. Since P(f/ = V) = 0, we have l(t4i < V) \ {U < V^), and thus, also 
W\{Un < V) W'1{U < V). Hence, by the dominated convergence theorem (Breiman, 
1968, 2.44), F{Un <V) — ^ F{U < V) and 

E{Wl{Un < V)) E{W1{U < V)) 

since 0 < !(•) < 1, \Wl{Un < H)| < |1T|, and E|1T| < oo. By assumption, F{U < H) > 0, 
hence F{Un < H) > 0 for all n sufficiently large, and 


E{W\Un < V) 


E{Wl{Un < H)) ^ E{W1{U < V)) 

F{Ur, < V) F{U < V) 


E{W\U < V). 


□ 


Proof of Theorem 5.3. We apply Lemma 5.1 with U = d{P0, Po), Un = dn{Xi.n, xi:„), V — R, 
and W = h{d). By assumption, Un U, F{U = V) — 0, P([/ < H) > 0, and E|1T| < oo. 
Hence, by Lemma 5.1, 


E(W \Un<V) —^ E(W \ u <V) = 


E{W1{U < V)) 


P(17 < V) 

E{WE{1{U < V)\W,U)) _ E{WG{U)) 


E(P([/ <V\U)) 


EG{U) 


since V L U,W by construction. This establishes Equation 5.2, and since in particular this 
holds for any bounded continuous h, Equation 5.1 follows. □ 


Proof of Corollary 5.5. Since Xi,..., Xn\9 i.i.d. ~ Pe and xi,... ,Xn behaves like an i.i.d. 
sequence from Po, then Pxi,n Pe Pxi,n Po (Dudley, 2002, Theorem 11.4.1). 
Hence, dn{Xi,n,Xi:n) d{P 0 ,Po), and Theorem 5.3 applies. □ 

Proof of Theorem 5.6. Apply Lemma 5.1 with U = d{P 0 , Po), Um = d{P 0 , Pm), V = R, and 
W = h{0). □ 
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